Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: pabouk Date: Jul 28, 2008 6:16am
Forum: general Subject: The archive retruns wrong Content-type HTTP header

Hello,
several times I have seen that the web archive returns bad "Content-type" HTTP header with wrong character set. Examples:

$ wget -S http://web.archive.org/web/20030524081559/http://www.iriverjapan.com/product.php?product=iHP-100
--2008-07-28 15:23:00
-- http://web.archive.org/web/20030524081559/http://www.iriverjapan.com/product.php?product=iHP-100
Resolving web.archive.org... 207.241.227.154
Connecting to web.archive.org|207.241.227.154|:80...
connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 28 Jul 2008 13:23:01 GMT
Server: Apache/2.2.4 (Ubuntu) PHP/5.2.3-1ubuntu6 mod_perl/2.0.2 Perl/v5.8.8
X-Powered-By: PHP/4.2.3
Content-Type: text/html; charset=UTF-8
Connection: close
Length: unspecified [text/html]

$ wget -S http://web.archive.org/web/20050518010425/http://www.didaktik.cz/pocitace_didaktik/didaktik_8.htm
--2008-07-28 15:33:53
-- http://web.archive.org/web/20050518010425/http://www.didaktik.cz/pocitace_didaktik/didaktik_8.htm
Resolving web.archive.org... 207.241.227.154
Connecting to web.archive.org|207.241.227.154|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Mon, 28 Jul 2008 13:33:54 GMT
Server: Apache/2.2.4 (Ubuntu) PHP/5.2.3-1ubuntu6 mod_perl/2.0.2 Perl/v5.8.8
Accept-Ranges: bytes
ETag: "306df66a864ec51:1695"
Last-Modified: Sun, 01 May 2005 19:46:07 GMT
Content-Length: 8750
Content-Type: text/html; charset=UTF-8
Connection: close
Length: 8750 (8.5K) [text/html]

In the first case the archive returns "Content-Type: text/html; charset=UTF-8" although the archived page is in "x-sjis" charset as it is indicated in the meta headers,

In the second case the archive returns "Content-Type: text/html; charset=UTF-8" again! (Is not it always?) Although the page is in "Windows-1250" charset although it is not indicated by meta headers.

In both cases better result would be acquired by omitting the "charset=UTF-8" part. Do you please know why the archive wrongly asserts the utf-8 character set? Unfortunately the HTTP header has the highest priority. Does the archive store the original HTTP headers?